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QQ I Cellular Simultaneous Recurrent Neural Network (SRN) has been shown to be a function 

Tj- ' approximator more powerful than the MLP. This means that the complexity of MLP would 

2 . be prohibitively large for some problems while SRN could realize the desired mapping with 

t — ; 

Q i acceptable computational constraints. The speed of training of complex recurrent net- 



works is crucial to their successful application. Present work improves the previous results 
by training the network with extended Kalman filter (EKF). We implemented a generic 
Cellular SRN and applied it for solving two challenging problems: 2D maze navigation and 
a subset of the connectedness problem. The speed of convergence has been improved by 
several orders of magnitude in comparison with the earlier results in the case of maze nav- 



*Ronian Ilin is with Department of Computer Science at The University of Memphis, Memphis, TN 

38117. E-mail: rilin@memphis.edu 

^Robert Kozma is with Department of Computer Science at The University of Memphis, Memphis, TN 

38117. E-mail: rkozma@memphis.edu 

■fPaul J. Werbos, Room 675, National Science Foundation, Arlington, VA 22230. E-mail: pwer- 

bos@nsf.gov 

^The opinions expressed in this paper are of the authors and do not necessarily reflect the views of their 
employers, in particular NSF 



1 



igation, and superior generalization has been demonstrated in the case of connectedness. 
The imphcations of this improvements are discussed. 
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1 Introduction 



The artificial neural networks, inspired by the enormous capabilities of living brains, are 
one of the cornerstones of today's field of artificial intelligence. Their applicability to 
real world engineering problems has become unquestionable in the recent decades, see for 
example [1]. Yet most of the networks used in the real world applications use the feed- 
forward architecture, which is a far cry from the massively recurrent architecture of the 
biological brains. The widespread use of feed-forward architecture is facilitated by the 
availability of numerous efficient training methods. However, the introduction of recurrent 
elements makes training more difficult and even impractical for most non-trivial cases. 

The SRN's have been shown to be more powerful function approximators by several 
researchers ( [2], [3]). It has been shown experimentally that an arbitrary function generated 
by a MLP can always be learned by an SRN. However the opposite was not true, as not 
all functions given by a SRN could be learned by a MLP. These results support the idea 
that the recurrent networks are essential in harnessing the power of brain-like computing. 

It is well known that MLPs and a variety of kernel-based networks (like RBF) are 
universal function approximators, in some sense. Andrew Barron [4] proved that MLPs 
are better than linear basis function systems like Taylor series in approximating smooth 
functions; more precisely, as the number of inputs N to a learning system grows, the 
required complexity for an MLP only grows as 0{N), while the complexity for a linear basis 
function approximator grows exponentially, for a given degree of accuracy in approximation. 
However, when the function to be approximated does not live up to the usual concept of 
smoothness, or when the number of inputs becomes even larger than what an MLP can 
readily handle, it becomes ever more important to use a more general class of neural 
network. 

The area of intelligent control provides examples of very difficult functions to be tackled 
by Ann's. Such functions arise as solutions to multistage optimization problems, given by 
the Bellman equation [HI The design of non-linear control systems, also known as "Adap- 
tive Critics" , presupposes the ability of the so called " Critic network" to approximate the 
solution of the Bellman equation. See [5] for overview of adaptive critic designs. Such prob- 
lems also are classified as Approximate Dynamic Programming (ADP). A simple example 
of such function is the 2D maze navigation problem, considered in this contribution. See [6] 
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for in depth overview of the ADP and Maze navigation problem. The apphcations of EKF 
for NN training have been developed by researchers in the field of control [7], [8], [5]. 

The classic challenge posed by Rosenblatt to perception theory is the recognition of 
topological relations [9]. Minsky and Papert [10] have shown that such problems funda- 
mentally cannot be solved by perceptrons because of their exponential complexity. The 
multi-layer perceptrons are more powerful than Rosenblatt's perceptron but they are also 
claimed to be fundamentally limited in their ability to solve topological relation prob- 
lems [11]. An example of such problem is the connectedness predicate. The task is to 
determine whether the input pattern is connected regardless of its shape and size. 

The two problems described above pose fundamental challenges to the new types of 
neural networks, just like the XOR problem posed a fundamental challenge to the per- 
ceptrons, which could be overcome only by the introduction of the hidden layer and thus 
effectively moving to the new type of ANN. 

In this contribution, we present the Cellular Simultaneous Neural Network (CSRN) 
architecture. This is a case of more generic architecture called ObjectNet, see [12], chap- 
ter 6, page 120. We use the Extended Kalman Filter (EKF) methodology for training 
our networks and obtain very encouraging results. For the first time an efficient train- 
ing methodology is applied to the complex recurrent network architecture. Extending the 
preliminary result introduced in [13], the present study addresses not only learning but 
also generalization of the network on two problem: maze and connectedness. Improve- 
ment in speed of learning by several orders of magnitude as a result of using EKF is also 
demonstrated. We consider the results introduced in this work as initial demonstration of 
the proposed learning principle, which should be thoroughly studied and implemented in 
various domains. 

The rest of this paper is organized as follows. Section [2] describes the calculation of 
derivatives in the recurrent network. Section [3] describes the CSRN architecture. Section H] 
gives the EKF formulas. Section [5] describes the operation of a generic CSRN application. 
Sections [6] and [7] describe the two problems addressed by this contribution and give the 
simulation results. Section [H] is the discussion and conclusions. 
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2 Backpropagation in complex networks 

The backpropagation algorithm is the foundation of neural network apphcations [14], [15]. 
Backpropagation relies on the ability to calculate the exact derivatives of the network 
outputs with respect to all the network parameters. 

Real live applications often demand complex networks with large number of parameters. 
In such cases, the use of the rule of ordered derivatives [14], [16] allows to obtain the 
derivatives in systematic manner. This rule also allows to simplify the calculations by 
breaking a complex network into simple building blocks, each characterized by its inputs, 
outputs and parameters. If the derivatives of the outputs of a simple building block with 
respect to all its internal parameters and inputs arc known, then the derivatives of the 
complete system can be easily obtained by backpropagating through each block. 

Suppose that the network consists of N units, or subnetworks, which are updated in 
order from 1 to N. We would hke to know the derivatives of the network outputs with 
respect to the parameters of each unit. In general case the final calculation for any network 
output j is a simple summation: 

^ = EE«i^ (1) 

i=i k=i 

here a stands for any parameter, i is the unit number, k is the index of the output of 
the current unit. Si is the derivative with respect to the input of the unit that is connected 
to the k^^output of the i^^ unit. Note that the k^^ output of the current unit can feed into 
several subsequent units and so the "delta" will be a sum of the "deltas" obtained from 
each unit. Also 5^'s are set externally as if the network were a part of a bigger system. If 
we simply want the derivatives of the outputs, set 5^ = 1. We provide an example of this 
calculation in Appendix A. 

Let's denote the outputs of our network as Zi. Ultimately we are interested in obtaining 
the derivatives of these outputs w.r.t. all the internal parameters. This is equivalent to 
calculating the Jacobian matrix of the system. For example, if we have two outputs and 
three internal parameters a,b, and c, the Jacobian will be 
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This matrix can be used to adjust the system's parameters using various methods such 
as gradient descent or Extended Kalman Fiher. 

So far we considered multi-layered feed-forward networks. The methodology described 
above can be extended to recurrent networks. Let us consider a feed-forward network with 
recurrent connections that link some of its outputs to some of its inputs. Suppose that 
the network is updated for N steps. We would like to calculate the derivatives of the final 
network outputs w.r.t. the weights of the network. This calculation is a case of Eq. [H 
Suppose that the network has m inputs, and n outputs. We assume that the expressions for 
the derivatives of all outputs w.r.t. each input and each network weight dzk/da, k = l..n, 
dzk/dxp,k = l..n,p = l..m are known and we will also denote the ordered derivatives 
d^Zk/da by F^. Then the full derivatives calculation is given by the algorithm in Fig. [TJ 
Note that we omit the loop over all the weight parameters a to improve readability. The 
result of this algorithm is the Jacobian matrix of the network after N iterations. 

3 Cellular Simultaneous Recurrent Networks 

SRN's can be used for static functional mapping, similarly to the MLP's. They differ from 
more widely known time lagged recurrent networks (TLRN) because the input in SRN 
is applied over many time steps and the output is read after the initial transitions have 
disappeared and the network is in equilibrium state. The most critical difference between 
TLRN and SRN is whether the network output is required at the same time step (TLRN) 
or after the network settles to an equilibrium (SRN). 

Many real live problems require to process patterns that form a 2D grid. For instance, 
such problems arise in image processing or in playing a game of chess. In those cases the 
structure of the neural network should also become a 2D grid. If we make all the elements 
of the grid identical, the resulting cellular neural network benefits from greatly reduced 
number of independent parameters. 

The combination of cellular structure with SRN creates very powerful function approx- 
imators. We developed a CSRN package that can be easily adopted to various problems. 
The architecture of the network is given in Fig. [2l The input is always a 2D grid. The 
number of cells in the network equals to the size of the input. The number of outputs also 
equals to the size of the input. Since most of the problems require only few outputs, we 
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add an arbitrary output transformation to the network. It has to be differentiable, but it 
does not have to have adjustable parameters. The training in our apphcations occurs only 
in the SRN. The cells of the network are connected through neighbor links. Each cell has 
four neighbors and the edges of the network wrap around. 

The cell of CSRN in our implementation is a generalized MLP [17], shown in Fig. [31 
Each non-input node of the GMLP is linked to all the subsequent nodes thus generalizing 
the idea of multi-layered network. The recurrent connections come from the output nodes 
of this cell and from its neighboring cells. It is an important feature of the used architecture 
that each cell has the same weights, which allows to build arbitrary large networks without 
increasing the number of weight parameters. 

4 Extended Kalman Filter for Network Training 

Kalman filters (KF) originated in signal processing. They present a computational tech- 
nique which allows to estimate the hidden state of a system based on observable measure- 
ments. See [18], [19] for derivation of the classical Kalman filter formulas based on the 
theory of multivariate normal distribution. 

In the case of neural network training, we are faced with the problem of determining 
the parameter weights in such a way that the measured outputs of the network are as close 
to the target values as possible. The network can be described as a dynamical system with 
its hidden state vector W formed by all the values of network weights, and the observable 
measurements vector formed by the values of network outputs Y. It is sometimes convenient 
to form a full state vector 5* that consists of both hidden and observable parts. Such 
formulation can be used in the derivation of Kalman filter [18]. In this paper we follow 
the convention in the literature which refer to W as the state vector [20]. Note that the 
outputs of the network can be expressed in terms of the weights as 

Y = CW (3) 

where C is the Jacobian matrix of the network evaluated around the current output 
vector Y. We assume that the state S is normally distributed and we are interested in the 
estimate of W based on the knowledge of Y and the underlying dynamical model of the 
system, which is simply Y{i + I) = t and W{i + 1) = W{i), where t is the target output 
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of the network. Suppose the covariance matrix of W is given by K, and the measurement 
noise covariance matrix by R. R is assumed to be normally distributed with zero mean. 
Then the Kalman update is given by the following formulas. 

We introduced the index i to denote current training step. The matrix Q{i) is the 
process noise covariance matrix. It represents our assumptions about the distribution of 
the true values of W . The formulas H] and [5] are the celebrated Extended Kalman filter 
formulas which can be found in numerous literature, see for example [21], [20] , [8]. If 
we look closely at HI we can see the similarity between the EKF update and the regular 
gradient descend update. In both cases we have some matrix coefficient multiplied by 
the difference (t — Y{i)). In the case of gradient descend , the coefficient is simply C{i) 
multiplied by some learning rate. In the case of EKF, the coefficient is more complex as it 
involves the covariance matrix which is the key to the efficiency of EKF. 

The process noise Q can be safely assumed to be 0, even though setting it to a non-zero 
value helps prevent K from becoming negative definite and destabilizing the filter. The 
measurement noise R also plays an important role in fine tuning the EKF by accelerating 
the speed of learning. The proper functioning of EKF depends on the assumption that 
the state vector S is normally distributed. This assumption usually does not hold in 
practice. However, adding the normally distributed noise described by R helps overcome 
this difficulty. R is usually chosen to be a random diagonal matrix. The values on the 
diagonal are annealed as the network training progresses, so that by the end of training, 
noise is insignificant. It turned out that the way R is annealed has significant effect on 
the rate of convergence. After experimenting with different functional forms we used the 
following formula: 

R{i) = alog{b5{if + 1)1 (6) 

where is the squared error, 5{i) = t — Y{i). The constants a and b were determined 
experimentally. We used a = b = 0.001, which gave reasonably good results presented in 
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the next sections. This functional form works better than hnear anneahng. Making the 
measurement noise a function of the error results in fast and reliable learning. 

The algorithm described above is suitable for learning one pattern. Learning multiple 
patterns creates additional challenges. The patterns can be learned one by one or in a 
batch. In the present work we used the batch mode which results in efficient learning but 
is computationally demanding. To explain this method, we write Eq. Hlmore compactly as 

5W = G6 (7) 

where G = KC^ / {CKC^ + R) and the time step index is omitted for clarity. The 
matrix G is called Kalman gain. 

Suppose that the network has s outputs and p weights. The size of matrix (7 is s by 
p, and the size of G is p by s. Suppose we have M patterns in a batch. If the network is 
duplicated M times, the resulting network will have M x s outputs. The size of C becomes 
M X s by p. Note that we simply concatenate M matrices together. Matrix G can still be 
computed from Eq. H] and its size becomes p by M x s. The weight update can be done 
just like in the case of one pattern, except now the matrix G encodes information about 
all patterns. 

This method is called multi-streaming [22], [23]. Increasing number of input patterns 
will result in large sizes of C and G. This will make batch update inefficient because of 
the need to invert large matrices. Therefore, larger problems will require more advanced 
numerical techniques already used by practitioners of EKF training [24] , [22] . 

5 CSRN training algorithm 

The network architecture given in Fig. [2] is very generic and with proper implementation 
can be easily adopted to different problems. The algorithm given in Fig. [8] describes 
the training of CSRN. The main loop calculates the Jacobian matrix C, which is used in 
the Kalman weight update. We perform testing during each training period and make a 
decision to stop training based on testing results. Output transformation is the part of the 
network that has to be customized for each problem. Other network parameters include 
network size, cell size, number of internal steps of SRN, EKF parameters K, R, and Q. 
The number of internal steps is selected large enough to allow the typical network to settle 
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to an equilibrium state. In the process of training the network dynamics changes and 
sometimes it no longer settles. Currently we do not terminate training even if equilibrium 
is not reached as such networks still achieve good levels of generalization. The Matlab 
implementation is available from the authors. 

6 Application of EKF Learning to Generalized Maze 
Navigation Problem 

6.1 Problem Description 

The generalized maze navigation consists of finding the optimal path from any initial 
position to the goal in a 2D grid world. An example of such a world is illustrated in Fig. |H 
One version of an algorithm for solving this problem will take a representation of the maze 
as its input and return the length of path from each clear cell to the goal. For example, 
for a 5 by 5 maze the output will consist of 25 numbers. Once we know the numbers it is 
very easy to find the optimal path from any cell by simply following the minimum among 
the neighbors. Examples of such outputs are given in Fig. [51 

2D Maze Navigation is a very simple representative of a broad class of problems solved 
using the techniques of Dynamic Programming, which means finding the J cost-to-go func- 
tion using the Bellman's equation (see for example [21]). Dynamic Programming gives the 
exact solution to multistage decision problems. More precisely, given a Markovian decision 
process with N possible states and the immediate expected cost of transition between any 
two states i and j denoted by c{i,j), the optimal cost-to-go function for each state satisfies 
the following Bellman's optimality equation. 



J{i) is the total expected cost from the initial state i, and 7 is the discount factor. The 
cost J depends on the policy /i, which is the mapping between the states and actions causing 
state transitions. The optimal expected cost results from the optimal policy fi*. Finding 
such policy directly from Eq. [8] is possible using recursive techniques but computationally 
expensive as the number of states of the problem grows. In the case of 2D maze, the 
immediate cost c{i,j) is always 1, and the probabilities Pij can only take values of or 1. 



N 




(8) 
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The J surface resulting from the 2D maze is a challenging function to be approximated 
by a neural network. It has been shown that an MLP cannot solve the generalized problem 
[2]. Therefore, this is a great problem to demonstrate the power of the Cellular SRN's. It 
has been shown that CSRN is capable of solving this problem by designing its weights in 
a certain way [25]. However the challenge is to train the network to do the same. 

The Cellular SRN to solve the m by m maze problem consists of m+2 by m+2 grid of 
identical units. The extra row and column on each side result from introducing the walls 
around the maze which prevent the agent from running away. Each unit receives input 
from the corresponding cell of the maze and returns the value of the J function for this cell. 
There are two inputs for each cell, one indicates whether this is a clear cell or an obstacle 
and the other supplies the values of the goal. As shown in Fig. [2|, the number of outputs 
of the cellular part of the network equals to the number of cells. In the maze application 
the final output is the values of J function for each input cell and therefore there is no need 
for the output transformation. 

6.2 Results of 2D Maze Navigation 

Previous results of training the Cellular SRN's showed slow convergence [2]. Those ex- 
periments used back-propagation with adaptive learning rate (ALR) [1]. The network 
consisted of 5 recurrent nodes in each cell and was trained on up to 6 mazes. The initial 
results demonstrated the ability of the network to learn the mazes [13]. 

The introduction of EKF significantly sped up the training of the Cellular SRN. In 
the case of single maze, the network reliably converges within 10-20 training cycles (see 
fig. [8]). In comparison, back- propagation through time with adaptive learning rate (ALR) 
takes between 500 and 1000 training cycles and is more dependent on the initial network 
weights [2]. 

We discovered that increasing the number of recurrent nodes from 5 to 15 allows to 
speed up both EKF and ALR training in case of multiple mazes. Nevertheless the EKF 
has a clear advantage. For more realistic learning assignment we use 30 training mazes and 
test the network with 10 previously unseen mazes. The training targets where computed 
using dynamic programming algorithm. Fig. [7JA. shows the sum squared error as function of 
the training step. We can see that EKF reaches the reasonable level in 150 training cycles. 
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For comparison the back-propagation through time with adaptive learning rate training is 
shown on the same graph. 

The true solution consists of integer values with the difference of one between the 
neighboring cells. We suppose that the approximation is reasonable if the maximum error 
per cell is less than 0.5, since in this case the correct differences will be preserved. This 
means that for a 7 by 7 network corresponding to 5 by 5 maze, the sum squared error has 
to fall below 49 * 0.5^ = 12.25. In Fig. [7]A the EKF drops below the 12.25 level within 
150 steps while ALR testing saturates at the level close to 50. We use 20 internal steps 
within each training cycle. In practical training scenarios the error is obviously not the 
same for each cell. Detailed statistical analysis can reveal the true nature of the expected 
distributions. In the present exploratory study we don't go into the details of statistics. 
Rather we introduce an empirical measure of the goodness of learnt navigation task in 
the following way. We count how many gradients point in the correct direction. The 
ratio of the number of correct gradients to the total number of gradients is our goodness 
ratio G that can vary from to 100 percent. The gradient of the J function gives the 
direction of the next move. As an example. Fig. [5] shows the J function computed by a 
network and the true J function. Fig. [5] demonstrates 2 erroneous gradient directions. The 
goodness G is illustrated in Fig. [7(3. We can see that EKF reaches testing performance of 
75-80 percent after 150 training cycles averaged over 10 testing mazes. On the other hand 
BP/ALR testing performance lingers around 50 percent chance level for several hundred 
training cycles. Even after 500 training cycles it is close to the chance level. This shows 
the potential of EKF for training CSRN's. 

7 Application of EKF to Connectedness Problem 
7.1 A Simple Connectedness Problem 

The description of connectedness problem can be found in [10]. The problem consists of 
answering the question "is the input pattern connected?". Such question is fundamental to 
our ability to segment visual images into separate objects which is the first preprocessing 
step before trying to recognize and classify the objects. We work with a subset of the con- 
nectedness problem, where we consider a square image and ask the following question: "are 
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the top left and the bottom right corners connected?" . Note that the diagonal connections 
do not count in our case, each pixel of a connected pattern has to have a neighbor on the 
left, right, top or bottom. Examples of such images are given in Fig. [61 This subset is still 
a difficult problem which could not be solved by the feedforward network [26]. The reason 
why connectedness is a difficult problem lies it its sequential nature. Human eye has to 
follow the borders of an image sequentially in order to classify it. This explains the need 
for recursion. 

The network architecture for the connectedness problem is that of Fig. [2J The output 
transformation is a GMLP with one output. The weights of this GMLP are randomly 
generated and fixed. The target outputs are +0.5 for connected pattern and -0.5 for 
disconnected pattern. 

7.2 Results of Connectedness Problem 

Here we present the results of solving the subset of connectedness problem. We applied the 
network to image sizes 5, 6, and 7. In each case we generated sets of 30 random connected 
and 30 disconnected patterns for training, and 10 connected and 10 disconnected patterns 
for testing. We used 20 internal iterations and the training took between 100 and 200 
training cycles. We used the same EKF parameters as in the case of maze navigation. 
After training on 30 patterns we tested the network and calculated the percent of correctly 
classified patterns. We applied the same set of patterns to a feed-forward network with one 
hidden layer. The size of the hidden layer was varied to obtain the best results. The results 
are summarized in the following table, where each number is averaged over 10 experiments 
and the standard deviation is also given. 
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Table I. Generalization with EKF learning for Connectedness problem 
Input Size Correctly Classified with CSRN Correctly Classified with MLP 



5x5 



80 ± 6 % 



66 ± 10 % 



6x6 



82 ± 6 % 



65 ± 12 % 



7x7 



88.5 ± 6 % 



63 ± 12 % 



We can see that the performance of MLP is just shghtly above chance level whereas 
the CSRN produces correct answers in 80-90% of test cases on previously unseen patterns. 
This performance is likely to improve by fine tuning network parameters. 

8 Discussion and Conclusions 

In this contribution we presented the advantages of using the Cellular SRN's as a more 
general topology compared to conventional MLP. We extended the previous results [27] 
by applying the CSRN to the connectedness problem. We applied an efficient learning 
methodology - EKF - to the SRN and obtained very encouraging results in terms of speed 
of convergence. The unit of our network is a generalized MLP. It can be easily substituted by 
any other feed-forward computation suitable for problem at hand, without any changes to 
the Cellular SRN. Now it becomes practical to use the proposed combination of architecture 
and the training method to any data that has 2D grid structure. The network size does 
not grow exponentially with the input size because of the weight sharing. The 100 by 
100 input pattern could potentially be processed by the CSRN with 15 units in each cell. 
However large networks still involve massive computations, which can be possibly addressed 
by efficient hardware implementations [28]. 

One example of such application is image processing. Detecting connectedness is a 
fundamental challenge in this field. We applied our CSRN to a subset of connectedness 
problem with minimal changes to the code. The results showed that CRSN is much better 
at recognizing connectedness compared to feed-forward architecture. 

Another example of such data is the board games. The games of chess and checkers 



14 



have long been used as testing problems for AI. Recently, neural networks coupled with 
evolutionary training methods have been successfully applied to the checkers [29] and chess 
[30]. The neural network architecture used in those works is the case of Object Net [12], 
mentioned in the introduction. The input pattern (the chess board) is divided into spacial 
components and the network is built with separate sub-units receiving input from their 
corresponding components. The interconnections between the sub-units of the network 
encode the spacial relationships between different parts of the board. The outputs of the 
Object Net feed into another multilayered network using to evaluate the overall "fitness" 
of the current situation on the board. 

Prom the above description we can see that the CSRN network is a simphfied case of the 
Object Net. The Chess Object Net belongs to the same class of multistage optimization 
problems, even though it does not presently use recurrent units. The biggest difference 
however, is the training method. The evolutionary computation has proven to be able 
to solve the problem, however at high computational cost. The architecture used in this 
contribution provides an efficient training method for the Object Net type of networks 
with more biologically plausible training using local derivatives information. The improved 
efficiency allows the use of SRN's, which are proven to be more powerful in function ap- 
proximation than the MLP's. Therefore, the Cellular SRN/EKF can be applicable to many 
interesting problems. 

This contribution builds upon several concepts: Recurrent NN's, Simultaneous Recur- 
rent Networks, Cellular NN's, Dynamic Programming, Kalman Pilfers. After reviewing 
each of the concepts, we presented a versatile recurrent neural network architecture capa- 
ble of efficient training based on EKP methodology. We demonstrate this novel application 
of EKP on the examples of the maze navigation problem and connectedness problem. De- 
tailed study of properties of EKP for CSRN training is in progress. 
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Appendix A: Example of Calculating Ordered Deriva- 
tives 

The following example is an illustration of the principles mentioned in section [2l Consider 
the network in Fig. [HI A. This network can be decomposed into 4 identical units as shown in 
Fig. [n]B, where each unit is a mapping between its inputs, internal parameters and outputs. 
This is a case of a simple recurrent cellular network with two cells and two iterations. We 
unfold the recurrent steps to demonstrate the application of the rule of ordered derivatives. 
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Each unit has 3 inputs Xi,X3, and x^, and 3 parameters a, b, and c. The outputs of 
each neuron are denoted by zi, Z2, and Z3. The first input neuron does not perform any 
transformation, so Zi = Xi. The second and third neurons use a nonhnear transformation 
/. The forward calculation of an elementary unit is as follows. 



Z2 = X2 + f{cXi) (A-1) 

Z3 = X3 + f{axi + bz2) (A-2) 
The order in which different quantities appear in the forward calculation is 

a;i,a;2,X3,c,Z2,a,6,Z3. (A-3) 

We would like to determine the derivatives of Z2 and w.r.t. the inputs and parameters. 
To do so, we apply the rule of ordered derivatives [14] given by the following formula. 

d+TARGET _dTARGET ^ d+TARGET dzj 
dzi dzi j±r^^ dzj dzi 

where TARGET is the variable the derivative of which w.r.t. Zi is sought, and the calcu- 
lation of TARGET involves using Zj^s is order of their subscripts. The notation is used 
for the ordered derivative, which simply means the full derivative, as opposed to simple 
partial derivative obtained by considering only the final equation involving TARGET. 

In order to calculate the derivatives in our example, we have to use the equation IA-41 
in reverse order of IA-3[ Let denote the derivative of /. 



9+£3 

db 

d+Zs 

da 
d+z, 

dZ2 
d+Zs 

dc 
d^z, 
dxs 
d+z, 

dX2 

d+zs 
dxi 



Z2(l){axi + bz2) 
xicpiaxi + bz2) 



dz3 

db 

dzs 

da 

—l = b(j){axi + bz2) 

dZ2 

d^Z'i d^Z2 
dz2 dc 
dz3 _ ^ 
dxs 

d^zs dz2 

dz2 dx2 
d^z-i dz2 

dz2 dxi 



= b(j){axi + bz2)xi(j){cxi) 

b(f){axi + bz2) 
b(t){axi + bz2)c(f){cxi) 



(A-5) 
(A-6) 
(A-7) 
(A-8) 
(A-9) 
(A-10) 
(A-11) 
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^ = |^ = x,0(cxO (A-12) 
oc oc 

^^'^ (A-13) 

(A-14) 



(9x3 dx^ 

3X2 9x2 

d^Z2 _ dz2 

dxi dxi 



(cxi) (A-15) 

(A-16) 



Knowing these derivatives, what are the derivatives of the full network? Let's add the 
superscript to each variable indicating which unit of the network it belongs to. Note that 
the outputs of the earlier unit become the inputs of the later unit. Consider unit 2, which 
gets input from units 1 and 3. If we apply [A^ to obtain the derivative of, for example, z"^ 
w.r.t. a, we will get, based on the topology of connections between the units, the following 
result. 

da da^ dzl da^ dz], da^ 
Obviously = = a as we use identical units. The quantities and are already 
obtained for each unit. Since z\ = x| and z^ = x\, the quantities and are 
equivalent to and ^^^^ which are also already calculated for each individual unit. They 
are the input "deltas", or the output derivatives "propagated" through the unit backwards. 
In other words, when all the quantities of each individual unit are calculated, then the 
total derivatives of the outputs of the full network w.r.t. any parameter are obtained by 
summing the individual unit's derivative multiplied by the corresponding "delta". The 
correspondence is determined by the topology of connections - knowing which output is 
connected to which input. Every time we backpropagate through a unit, we also set the 
values of the "deltas" of preceding units. In our example: 



c2 _ ^ ^3 _ ^ ^3 (K-m 

~ dzl ~ dxl ^ 

(A-20) 



And 
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da - da^ 9a3 (A-21) 

Likewise, in general case, the final calculation for any network output j is a simple 
summation given by Eq. [H 
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Figure Captions 

Figure [TJ Algorithm for calculating ordered derivatives in recurrent neural network. 
See section [2] for explanations. 

Figure [2j Generic Cellular SRN architecture. 

Figure [3j Cell of CSRN is a generalized MLP with m inputs and n outputs. The 
solid lines are adjustable weights and the dashed lines are unit weights. Note that 
the output of the cell is scaled by the output weight. 

Figure HI Example of 5 by 5 maze world. Black squares are obstacles. X is the 
location of the goal. The agent needs to find the shortest path from any white square 
to the goal. 

Figure O Comparison of the solution given by the network and the true solution. A - 
approximate solution, black arrows point in the wrong direction. B - exact solution. 

Figure [6j Examples of input patterns for Connectedness problem for 7 by 7 image. 

Figure [T) A. Average sum squared error for training on 30 mazes and testing on 10. 
Solid - EKF training error, dotted - EKF testing error, dashed - ALR training error, 
dash-dot - ALR testing error. The 12.25 threshold for sum squared error is shown 
by solid line. B. Average Goodness of Navigation G for training on 30 mazes and 
testing on 10. Solid - EKF training , dotted - EKF testing, dashed - ALR training, 
dash-dot - ALR testing. The 50 percent solid line is the chance level network. 

Figure [8j Pseudo-code for the training cycle of the CSRN. 

Figure O Simple feedforward network (A) which can be divided into 4 blocks (B). 
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11 Figures 



Calculation of derivatives in recurrent network 



Initialize deltas 

Initialize ordered derivatives 

for each time step 

for each network output 

find deltas 

update derivatives 



Figure 1: 



5f = l,fc = 

= dz^/da,k = l..n 
For t=N-l down to 1 
For k—1 to n 

— S^pG/an-m(fc) ^Z^)^^ /dXp 

Fa = n + ^Wk/doc 
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Network Output 




Figure 2: 
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Output 

Recurrent Link 
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Figure 4 



26 




5 


10 


3 


10 


10 


4 C 
51= 




n 


^ 1 c 


^ 

y 1 


10 


5 


10 


3 C 






') 6 C 









Figure 5: 



27 




Connected 



■ 


n 
u 


u 


n 


u 


■ 











-| 


-| 














1 








1 


1 


1 





1 











1 
























Not connected 



Figure 6: 



28 



O 160-1 V 




20 I > > > > 1 

100 200 300 400 500 

Training/testing step 



B 

Figure 7: 



29 



Training Cycle of CSRN with EKF weight update 
InitiaUze network weights 
Initiahze EKF parameters Q, R, K 
Repeat Until Training is Completed 
Set Jacobian C to empty matrix 
For each Training Pattern 

Run forward update of CSRN 
Calculate Network Output (s) and Error 
Back propagate Error though Output Transformation 
Backpropagatc Deltas from Output Transformation through CSRN 
Augment the Jacobian matrix C 
Calculate weight adjustments using EKF algorithm 
For each Testing Pattern (Stopping Criteria) 
Run forward update CSRN 
Calculate Network Output 
Determine whether training is completed 



Figure 8: 
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Figure 9: 
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